Gross domestic product (GDP) is one of the most common indicators used to track the health of a nation’s economy, and GDP per capita is calculated by taking the GDP of a country divided by its total population and is generally accepted as a measure of the standard of living. This study aims to examine the relationship between an average person’s living standards in a country and some other social indicators in order to build a model with reliable predictability.
The data used in this study is obtained from the world bank website that includes the GDP per capita, the unemployment rates, the urban population percentage, and the high-tech export of the countries in 2019, and the middle school enrollment rates in 2000. Data for the fiscal year of 2020 were available; however, the author chose to not based the study on that for the reason that the year has been an outlying occasion economically and socially for the world, and thus should not be considered for a study that aims to understand such relationship under the norm.
Four predictor variables were chosen to be used in this study. The yearly unemployment rates of a country were considered since a country is likely to be more successful fiscally with a larger working force. The urban population percentage was also considered as economic growth usually comes with urbanization, and they are positively correlated. The high-tech exports was chosen because it indicates if a country is industrialized and thus experience tremendous growth economically. Data for these indicators will be of the same year as the variable of interest, GDP per capita. However, only data for the predictor variable middle school enrollment rates is taken from the year 2000, because a time lag of 19 years indicates that the people who were going to middle school back then are now a major part of the workforce - the very class that generates most of the nation’s wealth.
After organizing the data and removing countries with missing values, and non-country/territory observations, 71 countries remain represented in the dataset, out of the 195 countries in the world (approximately 36.4%). We recognize that this is a considerable amount of data loss and could introduce potential biases and reduce the generalizability of our findings.
| Country | GDP2019 | MidSchool2000 | Unemployment2019 | HighTech2019 | UrbanPopPercentage2019 |
|---|---|---|---|---|---|
| Korea, Rep. | 31846.218 | 92.63574 | 4.148000 | 153561173548 | 81.43000 |
| Ecuador | 6183.824 | 47.76897 | 3.968000 | 68045527 | 63.98600 |
| United Kingdom | 42330.118 | 94.68536 | 3.851000 | 78176113113 | 83.65200 |
| North America | 63344.078 | 86.91654 | 3.889744 | 188786437936 | 82.36169 |
| Zimbabwe | 1463.986 | 42.45882 | 4.954000 | 27810712 | 32.21000 |
| n | min | median | mean | max | sd |
|---|---|---|---|---|---|
| 71 | 411.5523 | 11611.42 | 22962.63 | 114704.6 | 23786.58 |
Our total sample size is 71 (Table 2). The mean GDP per capita is about 22,962.63, far greater than our median 11,611.42, indicating that our GDP per capita distribution is heavily right-skewed and might be affected by outlying observation, which can easily be observed in Figure 1. This is understandable because the global wealth is not distributed evenly: some countries own significantly more wealth than others.
Figure 1. Distribution of the GDP per capita for individual countries in 2019
Figure 2 shows the distribution of unemployment rates in 2019, which is also right-skewed and have some extreme outliers lying around 20-30%. We can observe that middle school enrollment rate of countries in 2000 has a left-skewed distribution, however, the tail is heavy so those cannot be considered outliers in figure 3.
Figure 2. Distribution of the unemployment rate for individual countries in 2019
Figure 3. Distribution of the middle school enrollment rate for individual countries in 2000
Figure 4. Distribution of the high-tech exports for individual countries in 2019
In figure 4 and 5, while the distribution of high-tech export is extremely right-skewed with many outlying observations, the urban population percentage is only slightly left-skewed with no obvious outliers.
Figure 5. Distribution of the urban population percentage for individual countries in 2019
Figure 6. Interactive Scatterplot for the GDP per capita in 2019 for individual countries against their urban population percentage in the same year. The red line is the best fit line. The blue curve is the Loess curve.
In figure 6, the scatterplot shows that there seems to be some correlation between the GDP per capita and the Urban Population Percentage, which suggests that, without implying any causal effect, countries with a higher average standard of living for their people tend to have a higher proportion of its people living in urban areas.
The scatter plot in Figure 7 suggests that the unemployment rate and GDP per capita are negatively correlated. More notably, we notice that purple points cluster at the top whereas yellow points are more at the bottom. This implies that countries that had high middle-school enrollment rates in 2000 also have a higher standard of living 19 years later. This is better illustrated in Figure 8, we also notice that an upward curvature would better fit this relationship than a straight line.
Figure 7. Interactive Scatterplot for the GDP per capita in 2019 for individual countries against their unemployment rates of the same year. The red line is the best fit line. The blue curve is the Loess curve.
Figure 8. Interactive Scatterplot for the GDP per capita for individual countries against their middle school enrollment rates in the year 2000. The red line is the best fit line. The blue curve is the Loess curve.
Figure 9. Interactive Scatterplot for the GDP per capita 2019 for individual countries against their High-tech Exports. The red line is the best fit line. The blue curve is the Loess curve.
Since the exploratory part shows that the distribution of our GDP per capita is right-skewed and has some outliers, we have decided that it is in our best interest to transform the data to tackle this problem. We also recognize the danger of overfitting, so we will not be using box-cox to optimize the transformation (for this set of data), but rather go with a more “natural” type of transformation: taking the square root.
Figure 10. Distribution of GDP per capita in 2019 raised to 0.5, for individual countries
Using the following model:
## lm(formula = GDP2019_transf ~ HighTech2019 + ns(UrbanPopPercentage2019,
## df = 3) + ns(MidSchool2000, df = 3) + ns(Unemployment2019,
## df = 3), data = tidy_joined_dataset)
We have decided to keep the high-tech exports variable linear due to the fact that this chosen variable is extremely right skewed and does not have the spread needed for flexible alternatives (such as natural splines or polynomials). Except for that, we used natural splines for every other variables, which are unemployment rates, middle-school enrollment rates and urban population percentage. The number of knots used is 4, according to the sample size (<100).
After the square root transformation, we observe that, though not perfect, the plots have shown more promising results: In figures 11, 12, and 13, the normal Q-Q plot shows an almost straight line, the distribution of error terms is more symmetric, however, the residual scatter plot does seem to be violating the homoscedasticity assumption.
Figure 11. Normal Q-Qplot for the square root of GDP per capita in 2019
In table 3, we see that the GVIF value for the variables with 1 degree of freedom each, and the GVIF^(1/(2*Df)) value for the variables with more than 1 degree of freedom each are all between 1 and 5. This indicates that there is moderate correlation between the predictor variables. Since there is not a lot of multicollinearity between the predictor variables, the statistical power of the model is not greatly reduced, and we can perform the desired analysis.
| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| HighTech2019 | 1.181734 | 1 | 1.087076 |
| ns(UrbanPopPercentage2019, df = 3) | 2.175815 | 3 | 1.138336 |
| ns(MidSchool2000, df = 3) | 4.056648 | 3 | 1.262878 |
| ns(Unemployment2019, df = 3) | 2.894752 | 3 | 1.193810 |
Our model is the following:
## lm(formula = GDP2019_transf ~ HighTech2019 + ns(UrbanPopPercentage2019,
## df = 3) + ns(MidSchool2000, df = 3) + ns(Unemployment2019,
## df = 3), data = tidy_joined_dataset)
Given the nature of splines, interpretation of the model coefficients is deemed pointless as all else unchanged is not a possibility to predict the average square root of GDP per capita. Alternatively, we focus on examining the coefficients and their relative significance in the ANOVA table analysis section.
We notice that the coefficient p-values in table 4 tell us is that the urban population percentage and middle-school enrollment rates with their 1 and 3 levels, unemployment 1 level share the trait of their levels having a p-value < 0.05, whereas high-tech exports was found to be insignificant with p-values > 0.05.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 53.4012 | 28.5225 | 1.8722 | 0.0660 |
| HighTech2019 | 0.0000 | 0.0000 | 0.9017 | 0.3708 |
| ns(UrbanPopPercentage2019, df = 3)1 | 43.7492 | 25.0161 | 1.7488 | 0.0854 |
| ns(UrbanPopPercentage2019, df = 3)2 | 11.6767 | 51.7280 | 0.2257 | 0.8222 |
| ns(UrbanPopPercentage2019, df = 3)3 | 77.3089 | 25.0750 | 3.0831 | 0.0031 |
| ns(MidSchool2000, df = 3)1 | 86.5470 | 26.6525 | 3.2472 | 0.0019 |
| ns(MidSchool2000, df = 3)2 | 75.1719 | 78.5516 | 0.9570 | 0.3424 |
| ns(MidSchool2000, df = 3)3 | 148.4772 | 25.9431 | 5.7232 | 0.0000 |
| ns(Unemployment2019, df = 3)1 | -54.5920 | 32.2892 | -1.6907 | 0.0961 |
| ns(Unemployment2019, df = 3)2 | 47.3276 | 68.1545 | 0.6944 | 0.4901 |
| ns(Unemployment2019, df = 3)3 | 46.1235 | 39.6449 | 1.1634 | 0.2493 |
| Value | df | |
|---|---|---|
| Residual Standard Error | 43.042 | 60 |
| Multiple R-squared | 0.714 | |
| Adjusted R-squared | 0.666 |
| Value | Numerator df | Denominator df | |
|---|---|---|---|
| Model F-statistic | 14.97 | 10 | 60 |
| P-value | 6.338e-13 |
However, what important is the model as a whole is useful.Seeing the adjusted R-squared of 0.666 using our model, we found that it explains a lot of variability of the average GDP per capita transformed to the power of 0.5 which, coupled with the significance of the predictors and the low p-value of 6.338e-13 for our model, lead us to believe it is helpful in its explanatory ability.
From the ANOVA table in table 6, the High-tech Exports with 1 degree of freedom add 10866.301 sum of squares. With an F value =5.8654 and p-value equals 0.0185, we can conclude that the High-tech Exports alone in the model explains a significant amount of variability.
The Urban population Percentage variable with 4 knots and 3 degrees of freedom keeps adding 163412.005 sum of squares. With an F value =29.4020 and p-value equals 0.0000, we can conclude that the Urban population Percentage variable, given that the High-tech Exports in the model, is statistically significant.
The Middle-school enrollment rates variable with 4 knots and 3 degrees of freedom keeps adding 94844.386 sum of squares. With an F value =17.0649 and p-value equals 0.0000, we can conclude that the Middle-school enrollment rates variable, given that the High-tech Exports and Urban population Percentage with 4 knots in the model, is statistically significant.
The Unemployment Rates variable with 4 knots and 3 degrees of freedom keeps adding 8154.963 sum of squares. With an F value =1.4673 and p-value equals 0.2325, we can conclude that the Unemployment Rates variable, given that the High-tech Exports, Urban population Percentage with 4 knots, and Middle-school enrollment rates variable, also with 4 knots, in the model, is statistically insignificant.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| HighTech2019 | 1 | 10866.301 | 10866.301 | 5.8654 | 0.0185 |
| ns(UrbanPopPercentage2019, df = 3) | 3 | 163412.005 | 54470.668 | 29.4020 | 0.0000 |
| ns(MidSchool2000, df = 3) | 3 | 94844.386 | 31614.795 | 17.0649 | 0.0000 |
| ns(Unemployment2019, df = 3) | 3 | 8154.963 | 2718.321 | 1.4673 | 0.2325 |
| Residuals | 60 | 111156.967 | 1852.616 | NA | NA |
The 95% Prediction Intervals: For the 95% Prediction Interval, any country with 0% of urban population, unemployment rates = 5%, middle-school enrollment rates in 2000 is 60% and they export $10,000,000 worth of high-tech products, their square root of GDP per capita can be predicted at 98.09265 with the lower limit is -18.23444 and upper limit is 214.4197
With those countries holding the same value with unemployment rates = 5%, middle-school enrollment rates in 2000 is 60% and they export $10,000,000 worth of high-tech products. The Prediction Interval table below shows the predicted square root of GDP per capita for urban population percentage equals 40, 50, 60, and 70%.
| UrbanPopPercentage2019 | Point Estimate | Lower Limit | Upper Limit |
|---|---|---|---|
| 0 | 98.09265 | -18.23444 | 214.4197 |
| 40 | 56.62910 | -35.61627 | 148.8745 |
| 50 | 56.75335 | -35.23696 | 148.7437 |
| 60 | 66.94692 | -23.97692 | 157.8708 |
| 70 | 86.77952 | -4.18649 | 177.7455 |
We recognize that interpretability is sometimes to be traded for the sake of a better model. Our analysis shows that the model we proposed seems to be helpful as it explains quite a good amount of variability in GDP per capita in 2019 (66.6%).
This project is limited by the data available. The decision to use the combination chosen indicators reduced the usable countries down to only 36.4%, due to excluding countries with a missing value in any of the variables used. Additionally, there were some notable outliers and points with high leverage that could not be removed because they were not mistakes and thus are legit.
The choice to use a non-linear model made the interpretation of the relationship between the variables more complex and less straightforward, which is a trade off that the author is well aware of.
The study didn’t have any test of any kind for over fitting, so we don’t know how this proposed model will perform outside of this given data set.
```